experimental setup
What Knowledge Gets Distilled in Knowledge Distillation? Utkarsh Ojha Yuheng Li Anirudh Sundara Rajan Yingyu Liang Yong Jae Lee University of Wisconsin-Madison
Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher?
Experimental Setup
We provide an extended version of the Experimental Setup from Section 5 below. Linear Model This domain involves learning a linear model when the underlying mapping between features and predictions is cubic. Concretely, the aim is to choose the top B =1 out of N = 50 resources using a linear model. The fact that the features can be seen as 1-dimensional allows us to visualize the learned models (as seen in Figure 4). Predict: Given a feature xn U[0,1], use a linear model to predict the utility หyof choosing resource n, where the true utility is given by yn = 10x3n 6.5xn.
Appendix Table of Contents
There are several key limitations of the MADE algorithm: 1. As mentioned in Section 3.1, the MADE algorithm can only mask neural networks such that they respect the autoregressive property. The non-deterministic MADE masking algorithm presented in Germain et al. [2015], the resulting Proposition 1 formalizes this point. In Section 3.1, we showed that finding the weight masks for each neural network layer is equivalent Figure 7 provides a visual example of the steps performed by Algorithm 1. 's last row, we need the products of the last row of Randomly generated adjacency structures of 15 dimensions. IP gives better objective values when the adjacency matrix is very sparse.
SupplementaryMaterialfor" HierarchicalAdaptive ValueEstimationforMulti-modalVisual ReinforcementLearning "
Section C describes the details of the experimental setup, including network architectures, hyperparameters,andhardwaredetails. Thisoutcomeemphasizes the necessity of feature interaction or feature fusion to tackle intricate situations. Furthermore, an amalgamation of feature fusion and value fusion can offer better performance. This adjustment allows us to evaluate the robustness and adaptability of our approach in handling a larger number of vehicles in the environment. As we increase the number of vehicles on the road, Fig. A2 (a) clearly indicates that HAVE consistently delivers the highest performance. The training and testing curves of HAVE and other comparable methods are given in A4.